-
Notifications
You must be signed in to change notification settings - Fork 955
Introduce atomic slot migration #1949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
madolson
merged 125 commits into
valkey-io:unstable
from
enjoy-binbin:slot_migration_murphyjacob4
Aug 12, 2025
Merged
Introduce atomic slot migration #1949
madolson
merged 125 commits into
valkey-io:unstable
from
enjoy-binbin:slot_migration_murphyjacob4
Aug 12, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1. Define new structure slotRange and clusterSlotSyncLink; 2. Add CLUSTER SLOTLINK command to manage all the slot sync links. Signed-off-by: Binbin <[email protected]>
1. Add CLUSTER SLOTSYNC/SLOTSYNCFORCE command to trigger slot sync. Signed-off-by: Binbin <[email protected]>
1. Extend the SYNC command, let it specify the slot ranges; 2. Enable to filter the keys in the specified slots when generate rdb; 3. Implement the handshake process before rdb transfer for slot sync.f Signed-off-by: Binbin <[email protected]>
1. Implement the rdb transfer and loading for slot sync. Signed-off-by: Binbin <[email protected]>
1. Enable to filter the cmds in the specified slots when feed slaves; 2. Implement the messages exchange for slot sync; 3. Add clusterSlotSyncCron() to handle time events for slot sync. Signed-off-by: Binbin <[email protected]>
1. Add CLUSTER FAILOVER command to trigger slot failover; 2. Implement the process of slot failover. Signed-off-by: Binbin <[email protected]>
1. Improve the delDbKeysInSlot() to support time limit; 2. Implement the slot pending delete. Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
…ot RDBs In slot sync, we can load multiple slot RDBs from different nodes. slot RDBs will contain functions, which may encounter an `already exists` error when loading since every node will have the function data. This is a limitation in ours slot RDB design. This commit added a new RDBFLAGS_SLOT_SYNC flag, it is a hint that means we are targeting slot rdb. With this flag, when loading the function, we will treat it like FUNCTION LOAD REPLACE, so we won't have errors and the function reached last will win (which won't be a problem at the moment since all the function is the same in cluster). Signed-off-by: Binbin <[email protected]>
We have three nodes, A and B doing the slot sync, A is the src node, and B is the dst node. Before we doing the CLUSTER SLOTFAILOVER, we adds a new C, and C is the replica of B. In this time, if the connection between A and B breaks, in B's views, when doing freeClient, it will call onSlotSyncClientClose to reset the link, and then B will try to do a new slot SYNC in cron. And then B will call delkeysNotOwnedByMySelf to delete all the keys in that slot, which will propagate to C and C will becoma a empty DB. When B doing the new slot SYNC, A will generate a new slot RDB, and B will re-load the new slot RDB. But this new RDB file will not be propagated to node C because the master-replica connection between B and C is normal, which will result a data loss in C. In this commit, after the master node loads the slot RDB, we need to disconnect all slave nodes and allow the slave nodes to fully synchronize (we don't support slot psync). Signed-off-by: Binbin <[email protected]>
In SET command, the expire time will be rewrite to a new
robj, which mostly is an INT:
```
/* Propagate as SET Key Value PXAT millisecond-timestamp if there is
* EX/PX/EXAT/PXAT flag. */
robj *milliseconds_obj = createStringObjectFromLongLong(milliseconds);
rewriteClientCommandVector(c, 5, shared.set, key, val, shared.pxat, milliseconds_obj);
```
And then if we are doing resharding, the server will crash.
That is because we will call isCommandInSlotRanges to check
the slot, and we will call getKeysFromCommand to get the key
from command argv. The milliseconds_obj is an INT and the code
in setGetKeys treat it as a STRING, so the server crash.
In this fix, we decided that in isCommandInSlotRanges, if we
find an INT, we decode it to a STRING.
In addition, we added a new debug enable-debug-assert so that
we can try to cover the isCommandInSlotRanges function in ours
TCL tests.
Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
Let's take expire as an example. If the target node deleted an expired key or the key is logic expires during the process of the loading of the slot RDB, the source node may propagate the expire command after loading the slot RDB. These expire commands will become invalid on the logically expired keys, resulting in data loss. During the slot migration process, the keys are not considered expired in the expiration check of the command propagated by the source node. Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
The reply will mess up the slot replication. Signed-off-by: Binbin <[email protected]>
This causes the source node to not trim the reply, causing the querybuf of the target node to be too large, resulting in the target node being disconnected dut to the querybuf limit (1G). Signed-off-by: Binbin <[email protected]>
We need to make sure this block is only executed in the source node, otherwise targe node call it will reset the slot failover. Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
…flags Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
This can happend when multiple targets doing slot sync at the same time. When doing disk-based slot RDB replication, this replicationSetupReplicaForFullResync call here may result in sending the slot RDB to a different target, or to a normal replica, or sending a normal RDB to a target. Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
zuiderkwast
pushed a commit
that referenced
this pull request
Aug 18, 2025
We now pass in rdbSnapshotOptions options in this function, and options.conns is now malloc'ed in the caller side, so we need to zfree it when returning early due to an error. Previously, conns was malloc'ed after the error handling, so we don't have this. Introduced in #1949. --------- Signed-off-by: Binbin <[email protected]>
allenss-amazon
pushed a commit
to allenss-amazon/valkey-core
that referenced
this pull request
Aug 19, 2025
Introduces a new family of commands for migrating slots via replication. The procedure is driven by the source node which pushes an AOF formatted snapshot of the slots to the target, followed by a replication stream of changes on that slot (a la manual failover). This solution is an adaptation of the solution provided by @enjoy-binbin, combined with the solution I previously posted at valkey-io#1591, modified to meet the designs we had outlined in valkey-io#23. ## New commands * `CLUSTER MIGRATESLOTS SLOTSRANGE start end [start end]... NODE node-id`: Begin sending the slot via replication to the target. Multiple targets can be specified by repeating `SLOTSRANGE ... NODE ...` * `CLUSTER CANCELMIGRATION ALL`: Cancel all slot migrations * `CLUSTER GETSLOTMIGRATIONS`: See a recent log of migrations This PR only implements "one shot" semantics with an asynchronous model. Later, "two phase" (e.g. slot level replicate/failover commands) can be added with the same core. ## Slot migration jobs Introduces the concept of a slot migration job. While active, a job tracks a connection created by the source to the target over which the contents of the slots are sent. This connection is used for control messages as well as replicated slot data. Each job is given a 40 character random name to help uniquely identify it. All jobs, including those that finished recently, can be observed using the `CLUSTER GETSLOTMIGRATIONS` command. ## Replication * Since the snapshot uses AOF, the snapshot can be replayed verbatim to any replicas of the target node. * We use the same proxying mechanism used for chaining replication to copy the content sent by the source node directly to the replica nodes. ## `CLUSTER SYNCSLOTS` To coordinate the state machine transitions across the two nodes, a new command is added, `CLUSTER SYNCSLOTS`, that performs this control flow. Each end of the slot migration connection is expected to install a read handler in order to handle `CLUSTER SYNCSLOTS` commands: * `ESTABLISH`: Begins a slot migration. Provides slot migration information to the target and authorizes the connection to write to unowned slots. * `SNAPSHOT-EOF`: appended to the end of the snapshot to signal that the snapshot is done being written to the target. * `PAUSE`: informs the source node to pause whenever it gets the opportunity * `PAUSED`: added to the end of the client output buffer when the pause is performed. The pause is only performed after the buffer shrinks below a configurable size * `REQUEST-FAILOVER`: request the source to either grant or deny a failover for the slot migration. The grant is only granted if the target is still paused. Once a failover is granted, the paused is refreshed for a short duration * `FAILOVER-GRANTED`: sent to the target to inform that REQUEST-FAILOVER is granted * `ACK`: heartbeat command used to ensure liveness ## Interactions with other commands * FLUSHDB on the source node (which flushes the migrating slot) will result in the source dropping the connection, which will flush the slot on the target and reset the state machine back to the beginning. The subsequent retry should very quickly succeed (it is now empty) * FLUSHDB on the target will fail the slot migration. We can iterate with better handling, but for now it is expected that the operator would retry. * Genearlly, FLUSHDB is expected to be executed cluster wide, so preserving partially migrated slots doesn't make much sense * SCAN and KEYS are filtered to avoid exposing importing slot data ## Error handling * For any transient connection drops, the migration will be failed and require the user to retry. * If there is an OOM while reading from the import connection, we will fail the import, which will drop the importing slot data * If there is a client output buffer limit reached on the source node, it will drop the connection, which will cause the migration to fail * If at any point the export loses ownership or either node is failed over, a callback will be triggered on both ends of the migration to fail the import. The import will not reattempt with a new owner * The two ends of the migration are routinely pinging each other with SYNCSLOTS ACK messages. If at any point there is no interaction on the connection for longer than `repl-timeout`, the connection will be dropped, resulting in migration failure * If a failover happens, we will drop keys in all unowned slots. The migration does not persist through failovers and would need to be retried on the new source/target. ## State machine ``` Target/Importing Node State Machine ───────────────────────────────────────────────────────────── ┌────────────────────┐ │SLOT_IMPORT_WAIT_ACK┼──────┐ └──────────┬─────────┘ │ ACK│ │ ┌──────────────▼─────────────┐ │ │SLOT_IMPORT_RECEIVE_SNAPSHOT┼──┤ └──────────────┬─────────────┘ │ SNAPSHOT-EOF│ │ ┌───────────────▼──────────────┐ │ │SLOT_IMPORT_WAITING_FOR_PAUSED┼─┤ └───────────────┬──────────────┘ │ PAUSED│ │ ┌───────────────▼──────────────┐ │ Error Conditions: │SLOT_IMPORT_FAILOVER_REQUESTED┼─┤ 1. OOM └───────────────┬──────────────┘ │ 2. Slot Ownership Change FAILOVER-GRANTED│ │ 3. Demotion to replica ┌──────────────▼─────────────┐ │ 4. FLUSHDB │SLOT_IMPORT_FAILOVER_GRANTED┼──┤ 5. Connection Lost └──────────────┬─────────────┘ │ 6. No ACK from source (timeout) Takeover Performed│ │ ┌──────────────▼───────────┐ │ │SLOT_MIGRATION_JOB_SUCCESS┼────┤ └──────────────────────────┘ │ │ ┌─────────────────────────────────────▼─┐ │SLOT_IMPORT_FINISHED_WAITING_TO_CLEANUP│ └────────────────────┬──────────────────┘ Unowned Slots Cleaned Up│ ┌─────────────▼───────────┐ │SLOT_MIGRATION_JOB_FAILED│ └─────────────────────────┘ Source/Exporting Node State Machine ───────────────────────────────────────────────────────────── ┌──────────────────────┐ │SLOT_EXPORT_CONNECTING├─────────┐ └───────────┬──────────┘ │ Connected│ │ ┌─────────────▼────────────┐ │ │SLOT_EXPORT_AUTHENTICATING┼───────┤ └─────────────┬────────────┘ │ Authenticated│ │ ┌─────────────▼────────────┐ │ │SLOT_EXPORT_SEND_ESTABLISH┼───────┤ └─────────────┬────────────┘ │ ESTABLISH command written│ │ ┌─────────────────────▼─────────────┐ │ │SLOT_EXPORT_READ_ESTABLISH_RESPONSE┼──────┤ └─────────────────────┬─────────────┘ │ Full response read (+OK)│ │ ┌────────────────▼──────────────┐ │ Error Conditions: │SLOT_EXPORT_WAITING_TO_SNAPSHOT┼─────┤ 1. User sends CANCELMIGRATION └────────────────┬──────────────┘ │ 2. Slot ownership change No other child process│ │ 3. Demotion to replica ┌────────────▼───────────┐ │ 4. FLUSHDB │SLOT_EXPORT_SNAPSHOTTING┼────────┤ 5. Connection Lost └────────────┬───────────┘ │ 6. AUTH failed Snapshot done│ │ 7. ERR from ESTABLISH command ┌───────────▼─────────┐ │ 8. Unpaused before failover completed │SLOT_EXPORT_STREAMING┼──────────┤ 9. Snapshot failed (e.g. Child OOM) └───────────┬─────────┘ │ 10. No ack from target (timeout) PAUSE│ │ 11. Client output buffer overrun ┌──────────────▼─────────────┐ │ │SLOT_EXPORT_WAITING_TO_PAUSE┼──────┤ └──────────────┬─────────────┘ │ Buffer drained│ │ ┌──────────────▼────────────┐ │ │SLOT_EXPORT_FAILOVER_PAUSED┼───────┤ └──────────────┬────────────┘ │ Failover request granted│ │ ┌───────────────▼────────────┐ │ │SLOT_EXPORT_FAILOVER_GRANTED┼───────┤ └───────────────┬────────────┘ │ New topology received│ │ ┌──────────────▼───────────┐ │ │SLOT_MIGRATION_JOB_SUCCESS│ │ └──────────────────────────┘ │ │ ┌─────────────────────────┐ │ │SLOT_MIGRATION_JOB_FAILED│◄────────┤ └─────────────────────────┘ │ │ ┌────────────────────────────┐ │ │SLOT_MIGRATION_JOB_CANCELLED│◄──────┘ └────────────────────────────┘ ``` Co-authored-by: Binbin <[email protected]> --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Jacob Murphy <[email protected]> Signed-off-by: Madelyn Olson <[email protected]> Co-authored-by: Binbin <[email protected]> Co-authored-by: Ping Xie <[email protected]> Co-authored-by: Madelyn Olson <[email protected]>
allenss-amazon
pushed a commit
to allenss-amazon/valkey-core
that referenced
this pull request
Aug 19, 2025
We now pass in rdbSnapshotOptions options in this function, and options.conns is now malloc'ed in the caller side, so we need to zfree it when returning early due to an error. Previously, conns was malloc'ed after the error handling, so we don't have this. Introduced in valkey-io#1949. --------- Signed-off-by: Binbin <[email protected]>
asagege
pushed a commit
to asagege/valkey
that referenced
this pull request
Aug 19, 2025
We now pass in rdbSnapshotOptions options in this function, and options.conns is now malloc'ed in the caller side, so we need to zfree it when returning early due to an error. Previously, conns was malloc'ed after the error handling, so we don't have this. Introduced in valkey-io#1949. --------- Signed-off-by: Binbin <[email protected]>
hwware
pushed a commit
that referenced
this pull request
Aug 22, 2025
This may result in meaningless slot migration job, we should return an error to user in advance to avoid operation error. Also `by myself` is not correct English grammar and `myself` is a internal code terminology, changed to `by this node`. Was introduced in #1949. --------- Signed-off-by: Binbin <[email protected]>
enjoy-binbin
added a commit
to enjoy-binbin/valkey
that referenced
this pull request
Aug 26, 2025
If all cluster nodes have functions, slot migration will fail since the target will return the function already exists error when doing the FUNCTION LOAD. And in addition, the target's replica could panic when it executes the FUNCTION LOAD propagated from the primary (see propagation-error-behavior). Introduced in valkey-io#1949. Signed-off-by: Binbin <[email protected]>
enjoy-binbin
added a commit
that referenced
this pull request
Aug 29, 2025
If all cluster nodes have functions, slot migration will fail since the target will return the function already exists error when doing the FUNCTION LOAD. And in addition, the target's replica could panic when it executes the FUNCTION LOAD propagated from the primary (see propagation-error-behavior). Introduced in #1949. Signed-off-by: Binbin <[email protected]>
enjoy-binbin
added a commit
that referenced
this pull request
Aug 29, 2025
…ous reading of auth response (#2494) The old SLOT_EXPORT_AUTHENTICATING added in #1949, when processed by the source node, we will send the AUTH command and then reads the response. If the target node is blocked during this process, the source node will also be blocked. We should use a read handler to handle this. We split SLOT_EXPORT_AUTHENTICATING into SLOT_EXPORT_SEND_AUTH and SLOT_EXPORT_READ_AUTH_RESPONSE to avoid this issue. Signed-off-by: Binbin <[email protected]>
17 tasks
enjoy-binbin
added a commit
that referenced
this pull request
Sep 18, 2025
When we adding atomic slot migration in #1949, we reused a lot of rdb save code, it was an easier way to implement ASM in the first time, but it comes with some side effect. Like we are using CHILD_TYPE_RDB to do the fork, we use rdb.c/rdb.h function to save the snapshot, these mess up the logs (we will print some logs saying we are doing RDB stuff) and mess up the info fields (we will say we are rdb_bgsave_in_progress but actually we are doing slot migration). In addition, it makes the code difficult to maintain. The rdb_save method uses a lot of rdb_* variables, but we are actually doing slot migration. If we want to support one fork with multiple target nodes, we need to rewrite these code for a better cleanup. Note that the changes to rdb.c/rdb.h are reverting previous changes from when we was reusing this code for slot migration. The slot migration snapshot logic is similar to the previous diskless replication. We use pipe to transfer the snapshot data from the child process to the parent process. Interface changes: - New slot_migration_fork_in_progress info field. - New cow_size field in CLUSTER GETSLOTMIGRATIONS command. - Also add slot migration fork to the cluster class trace latency. Signed-off-by: Binbin <[email protected]> Signed-off-by: Jacob Murphy <[email protected]> Co-authored-by: Jacob Murphy <[email protected]>
rjd15372
pushed a commit
to rjd15372/valkey
that referenced
this pull request
Sep 19, 2025
We now pass in rdbSnapshotOptions options in this function, and options.conns is now malloc'ed in the caller side, so we need to zfree it when returning early due to an error. Previously, conns was malloc'ed after the error handling, so we don't have this. Introduced in valkey-io#1949. --------- Signed-off-by: Binbin <[email protected]>
rjd15372
pushed a commit
to rjd15372/valkey
that referenced
this pull request
Sep 19, 2025
This may result in meaningless slot migration job, we should return an error to user in advance to avoid operation error. Also `by myself` is not correct English grammar and `myself` is a internal code terminology, changed to `by this node`. Was introduced in valkey-io#1949. --------- Signed-off-by: Binbin <[email protected]>
rjd15372
pushed a commit
to rjd15372/valkey
that referenced
this pull request
Sep 19, 2025
If all cluster nodes have functions, slot migration will fail since the target will return the function already exists error when doing the FUNCTION LOAD. And in addition, the target's replica could panic when it executes the FUNCTION LOAD propagated from the primary (see propagation-error-behavior). Introduced in valkey-io#1949. Signed-off-by: Binbin <[email protected]>
rjd15372
pushed a commit
to rjd15372/valkey
that referenced
this pull request
Sep 19, 2025
…ous reading of auth response (valkey-io#2494) The old SLOT_EXPORT_AUTHENTICATING added in valkey-io#1949, when processed by the source node, we will send the AUTH command and then reads the response. If the target node is blocked during this process, the source node will also be blocked. We should use a read handler to handle this. We split SLOT_EXPORT_AUTHENTICATING into SLOT_EXPORT_SEND_AUTH and SLOT_EXPORT_READ_AUTH_RESPONSE to avoid this issue. Signed-off-by: Binbin <[email protected]>
rjd15372
pushed a commit
to rjd15372/valkey
that referenced
this pull request
Sep 19, 2025
When we adding atomic slot migration in valkey-io#1949, we reused a lot of rdb save code, it was an easier way to implement ASM in the first time, but it comes with some side effect. Like we are using CHILD_TYPE_RDB to do the fork, we use rdb.c/rdb.h function to save the snapshot, these mess up the logs (we will print some logs saying we are doing RDB stuff) and mess up the info fields (we will say we are rdb_bgsave_in_progress but actually we are doing slot migration). In addition, it makes the code difficult to maintain. The rdb_save method uses a lot of rdb_* variables, but we are actually doing slot migration. If we want to support one fork with multiple target nodes, we need to rewrite these code for a better cleanup. Note that the changes to rdb.c/rdb.h are reverting previous changes from when we was reusing this code for slot migration. The slot migration snapshot logic is similar to the previous diskless replication. We use pipe to transfer the snapshot data from the child process to the parent process. Interface changes: - New slot_migration_fork_in_progress info field. - New cow_size field in CLUSTER GETSLOTMIGRATIONS command. - Also add slot migration fork to the cluster class trace latency. Signed-off-by: Binbin <[email protected]> Signed-off-by: Jacob Murphy <[email protected]> Co-authored-by: Jacob Murphy <[email protected]>
rjd15372
pushed a commit
that referenced
this pull request
Sep 23, 2025
We now pass in rdbSnapshotOptions options in this function, and options.conns is now malloc'ed in the caller side, so we need to zfree it when returning early due to an error. Previously, conns was malloc'ed after the error handling, so we don't have this. Introduced in #1949. --------- Signed-off-by: Binbin <[email protected]>
rjd15372
pushed a commit
that referenced
this pull request
Sep 23, 2025
This may result in meaningless slot migration job, we should return an error to user in advance to avoid operation error. Also `by myself` is not correct English grammar and `myself` is a internal code terminology, changed to `by this node`. Was introduced in #1949. --------- Signed-off-by: Binbin <[email protected]>
rjd15372
pushed a commit
that referenced
this pull request
Sep 23, 2025
If all cluster nodes have functions, slot migration will fail since the target will return the function already exists error when doing the FUNCTION LOAD. And in addition, the target's replica could panic when it executes the FUNCTION LOAD propagated from the primary (see propagation-error-behavior). Introduced in #1949. Signed-off-by: Binbin <[email protected]>
rjd15372
pushed a commit
that referenced
this pull request
Sep 23, 2025
…ous reading of auth response (#2494) The old SLOT_EXPORT_AUTHENTICATING added in #1949, when processed by the source node, we will send the AUTH command and then reads the response. If the target node is blocked during this process, the source node will also be blocked. We should use a read handler to handle this. We split SLOT_EXPORT_AUTHENTICATING into SLOT_EXPORT_SEND_AUTH and SLOT_EXPORT_READ_AUTH_RESPONSE to avoid this issue. Signed-off-by: Binbin <[email protected]>
rjd15372
pushed a commit
that referenced
this pull request
Sep 23, 2025
When we adding atomic slot migration in #1949, we reused a lot of rdb save code, it was an easier way to implement ASM in the first time, but it comes with some side effect. Like we are using CHILD_TYPE_RDB to do the fork, we use rdb.c/rdb.h function to save the snapshot, these mess up the logs (we will print some logs saying we are doing RDB stuff) and mess up the info fields (we will say we are rdb_bgsave_in_progress but actually we are doing slot migration). In addition, it makes the code difficult to maintain. The rdb_save method uses a lot of rdb_* variables, but we are actually doing slot migration. If we want to support one fork with multiple target nodes, we need to rewrite these code for a better cleanup. Note that the changes to rdb.c/rdb.h are reverting previous changes from when we was reusing this code for slot migration. The slot migration snapshot logic is similar to the previous diskless replication. We use pipe to transfer the snapshot data from the child process to the parent process. Interface changes: - New slot_migration_fork_in_progress info field. - New cow_size field in CLUSTER GETSLOTMIGRATIONS command. - Also add slot migration fork to the cluster class trace latency. Signed-off-by: Binbin <[email protected]> Signed-off-by: Jacob Murphy <[email protected]> Co-authored-by: Jacob Murphy <[email protected]>
hpatro
pushed a commit
to hpatro/valkey
that referenced
this pull request
Oct 3, 2025
Introduces a new family of commands for migrating slots via replication. The procedure is driven by the source node which pushes an AOF formatted snapshot of the slots to the target, followed by a replication stream of changes on that slot (a la manual failover). This solution is an adaptation of the solution provided by @enjoy-binbin, combined with the solution I previously posted at valkey-io#1591, modified to meet the designs we had outlined in valkey-io#23. ## New commands * `CLUSTER MIGRATESLOTS SLOTSRANGE start end [start end]... NODE node-id`: Begin sending the slot via replication to the target. Multiple targets can be specified by repeating `SLOTSRANGE ... NODE ...` * `CLUSTER CANCELMIGRATION ALL`: Cancel all slot migrations * `CLUSTER GETSLOTMIGRATIONS`: See a recent log of migrations This PR only implements "one shot" semantics with an asynchronous model. Later, "two phase" (e.g. slot level replicate/failover commands) can be added with the same core. ## Slot migration jobs Introduces the concept of a slot migration job. While active, a job tracks a connection created by the source to the target over which the contents of the slots are sent. This connection is used for control messages as well as replicated slot data. Each job is given a 40 character random name to help uniquely identify it. All jobs, including those that finished recently, can be observed using the `CLUSTER GETSLOTMIGRATIONS` command. ## Replication * Since the snapshot uses AOF, the snapshot can be replayed verbatim to any replicas of the target node. * We use the same proxying mechanism used for chaining replication to copy the content sent by the source node directly to the replica nodes. ## `CLUSTER SYNCSLOTS` To coordinate the state machine transitions across the two nodes, a new command is added, `CLUSTER SYNCSLOTS`, that performs this control flow. Each end of the slot migration connection is expected to install a read handler in order to handle `CLUSTER SYNCSLOTS` commands: * `ESTABLISH`: Begins a slot migration. Provides slot migration information to the target and authorizes the connection to write to unowned slots. * `SNAPSHOT-EOF`: appended to the end of the snapshot to signal that the snapshot is done being written to the target. * `PAUSE`: informs the source node to pause whenever it gets the opportunity * `PAUSED`: added to the end of the client output buffer when the pause is performed. The pause is only performed after the buffer shrinks below a configurable size * `REQUEST-FAILOVER`: request the source to either grant or deny a failover for the slot migration. The grant is only granted if the target is still paused. Once a failover is granted, the paused is refreshed for a short duration * `FAILOVER-GRANTED`: sent to the target to inform that REQUEST-FAILOVER is granted * `ACK`: heartbeat command used to ensure liveness ## Interactions with other commands * FLUSHDB on the source node (which flushes the migrating slot) will result in the source dropping the connection, which will flush the slot on the target and reset the state machine back to the beginning. The subsequent retry should very quickly succeed (it is now empty) * FLUSHDB on the target will fail the slot migration. We can iterate with better handling, but for now it is expected that the operator would retry. * Genearlly, FLUSHDB is expected to be executed cluster wide, so preserving partially migrated slots doesn't make much sense * SCAN and KEYS are filtered to avoid exposing importing slot data ## Error handling * For any transient connection drops, the migration will be failed and require the user to retry. * If there is an OOM while reading from the import connection, we will fail the import, which will drop the importing slot data * If there is a client output buffer limit reached on the source node, it will drop the connection, which will cause the migration to fail * If at any point the export loses ownership or either node is failed over, a callback will be triggered on both ends of the migration to fail the import. The import will not reattempt with a new owner * The two ends of the migration are routinely pinging each other with SYNCSLOTS ACK messages. If at any point there is no interaction on the connection for longer than `repl-timeout`, the connection will be dropped, resulting in migration failure * If a failover happens, we will drop keys in all unowned slots. The migration does not persist through failovers and would need to be retried on the new source/target. ## State machine ``` Target/Importing Node State Machine ───────────────────────────────────────────────────────────── ┌────────────────────┐ │SLOT_IMPORT_WAIT_ACK┼──────┐ └──────────┬─────────┘ │ ACK│ │ ┌──────────────▼─────────────┐ │ │SLOT_IMPORT_RECEIVE_SNAPSHOT┼──┤ └──────────────┬─────────────┘ │ SNAPSHOT-EOF│ │ ┌───────────────▼──────────────┐ │ │SLOT_IMPORT_WAITING_FOR_PAUSED┼─┤ └───────────────┬──────────────┘ │ PAUSED│ │ ┌───────────────▼──────────────┐ │ Error Conditions: │SLOT_IMPORT_FAILOVER_REQUESTED┼─┤ 1. OOM └───────────────┬──────────────┘ │ 2. Slot Ownership Change FAILOVER-GRANTED│ │ 3. Demotion to replica ┌──────────────▼─────────────┐ │ 4. FLUSHDB │SLOT_IMPORT_FAILOVER_GRANTED┼──┤ 5. Connection Lost └──────────────┬─────────────┘ │ 6. No ACK from source (timeout) Takeover Performed│ │ ┌──────────────▼───────────┐ │ │SLOT_MIGRATION_JOB_SUCCESS┼────┤ └──────────────────────────┘ │ │ ┌─────────────────────────────────────▼─┐ │SLOT_IMPORT_FINISHED_WAITING_TO_CLEANUP│ └────────────────────┬──────────────────┘ Unowned Slots Cleaned Up│ ┌─────────────▼───────────┐ │SLOT_MIGRATION_JOB_FAILED│ └─────────────────────────┘ Source/Exporting Node State Machine ───────────────────────────────────────────────────────────── ┌──────────────────────┐ │SLOT_EXPORT_CONNECTING├─────────┐ └───────────┬──────────┘ │ Connected│ │ ┌─────────────▼────────────┐ │ │SLOT_EXPORT_AUTHENTICATING┼───────┤ └─────────────┬────────────┘ │ Authenticated│ │ ┌─────────────▼────────────┐ │ │SLOT_EXPORT_SEND_ESTABLISH┼───────┤ └─────────────┬────────────┘ │ ESTABLISH command written│ │ ┌─────────────────────▼─────────────┐ │ │SLOT_EXPORT_READ_ESTABLISH_RESPONSE┼──────┤ └─────────────────────┬─────────────┘ │ Full response read (+OK)│ │ ┌────────────────▼──────────────┐ │ Error Conditions: │SLOT_EXPORT_WAITING_TO_SNAPSHOT┼─────┤ 1. User sends CANCELMIGRATION └────────────────┬──────────────┘ │ 2. Slot ownership change No other child process│ │ 3. Demotion to replica ┌────────────▼───────────┐ │ 4. FLUSHDB │SLOT_EXPORT_SNAPSHOTTING┼────────┤ 5. Connection Lost └────────────┬───────────┘ │ 6. AUTH failed Snapshot done│ │ 7. ERR from ESTABLISH command ┌───────────▼─────────┐ │ 8. Unpaused before failover completed │SLOT_EXPORT_STREAMING┼──────────┤ 9. Snapshot failed (e.g. Child OOM) └───────────┬─────────┘ │ 10. No ack from target (timeout) PAUSE│ │ 11. Client output buffer overrun ┌──────────────▼─────────────┐ │ │SLOT_EXPORT_WAITING_TO_PAUSE┼──────┤ └──────────────┬─────────────┘ │ Buffer drained│ │ ┌──────────────▼────────────┐ │ │SLOT_EXPORT_FAILOVER_PAUSED┼───────┤ └──────────────┬────────────┘ │ Failover request granted│ │ ┌───────────────▼────────────┐ │ │SLOT_EXPORT_FAILOVER_GRANTED┼───────┤ └───────────────┬────────────┘ │ New topology received│ │ ┌──────────────▼───────────┐ │ │SLOT_MIGRATION_JOB_SUCCESS│ │ └──────────────────────────┘ │ │ ┌─────────────────────────┐ │ │SLOT_MIGRATION_JOB_FAILED│◄────────┤ └─────────────────────────┘ │ │ ┌────────────────────────────┐ │ │SLOT_MIGRATION_JOB_CANCELLED│◄──────┘ └────────────────────────────┘ ``` Co-authored-by: Binbin <[email protected]> --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Jacob Murphy <[email protected]> Signed-off-by: Madelyn Olson <[email protected]> Co-authored-by: Binbin <[email protected]> Co-authored-by: Ping Xie <[email protected]> Co-authored-by: Madelyn Olson <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]>
hpatro
pushed a commit
to hpatro/valkey
that referenced
this pull request
Oct 3, 2025
We now pass in rdbSnapshotOptions options in this function, and options.conns is now malloc'ed in the caller side, so we need to zfree it when returning early due to an error. Previously, conns was malloc'ed after the error handling, so we don't have this. Introduced in valkey-io#1949. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]>
hpatro
pushed a commit
to hpatro/valkey
that referenced
this pull request
Oct 3, 2025
This may result in meaningless slot migration job, we should return an error to user in advance to avoid operation error. Also `by myself` is not correct English grammar and `myself` is a internal code terminology, changed to `by this node`. Was introduced in valkey-io#1949. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]>
hpatro
pushed a commit
to hpatro/valkey
that referenced
this pull request
Oct 3, 2025
If all cluster nodes have functions, slot migration will fail since the target will return the function already exists error when doing the FUNCTION LOAD. And in addition, the target's replica could panic when it executes the FUNCTION LOAD propagated from the primary (see propagation-error-behavior). Introduced in valkey-io#1949. Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]>
hpatro
pushed a commit
to hpatro/valkey
that referenced
this pull request
Oct 3, 2025
…ous reading of auth response (valkey-io#2494) The old SLOT_EXPORT_AUTHENTICATING added in valkey-io#1949, when processed by the source node, we will send the AUTH command and then reads the response. If the target node is blocked during this process, the source node will also be blocked. We should use a read handler to handle this. We split SLOT_EXPORT_AUTHENTICATING into SLOT_EXPORT_SEND_AUTH and SLOT_EXPORT_READ_AUTH_RESPONSE to avoid this issue. Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
cluster
needs-doc-pr
This change needs to update a documentation page. Remove label once doc PR is open.
release-notes
This issue should get a line item in the release notes
run-extra-tests
Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Introduces a new family of commands for migrating slots via replication. The procedure is driven by the source node which pushes an AOF formatted snapshot of the slots to the target, followed by a replication stream of changes on that slot (a la manual failover).
This solution is an adaptation of the solution provided by @enjoy-binbin, combined with the solution I previously posted at #1591, modified to meet the designs we had outlined in #23.
New commands
CLUSTER MIGRATESLOTS SLOTSRANGE start end [start end]... NODE node-id: Begin sending the slot via replication to the target. Multiple targets can be specified by repeatingSLOTSRANGE ... NODE ...CLUSTER CANCELMIGRATION ALL: Cancel all slot migrationsCLUSTER GETSLOTMIGRATIONS: See a recent log of migrationsThis PR only implements "one shot" semantics with an asynchronous model. Later, "two phase" (e.g. slot level replicate/failover commands) can be added with the same core.
Slot migration jobs
Introduces the concept of a slot migration job. While active, a job tracks a connection created by the source to the target over which the contents of the slots are sent. This connection is used for control messages as well as replicated slot data. Each job is given a 40 character random name to help uniquely identify it.
All jobs, including those that finished recently, can be observed using the
CLUSTER GETSLOTMIGRATIONScommand.Replication
CLUSTER SYNCSLOTSTo coordinate the state machine transitions across the two nodes, a new command is added,
CLUSTER SYNCSLOTS, that performs this control flow.Each end of the slot migration connection is expected to install a read handler in order to handle
CLUSTER SYNCSLOTScommands:ESTABLISH: Begins a slot migration. Provides slot migration information to the target and authorizes the connection to write to unowned slots.SNAPSHOT-EOF: appended to the end of the snapshot to signal that the snapshot is done being written to the target.PAUSE: informs the source node to pause whenever it gets the opportunityPAUSED: added to the end of the client output buffer when the pause is performed. The pause is only performed after the buffer shrinks below a configurable sizeREQUEST-FAILOVER: request the source to either grant or deny a failover for the slot migration. The grant is only granted if the target is still paused. Once a failover is granted, the paused is refreshed for a short durationFAILOVER-GRANTED: sent to the target to inform that REQUEST-FAILOVER is grantedACK: heartbeat command used to ensure livenessInteractions with other commands
Error handling
repl-timeout, the connection will be dropped, resulting in migration failureState machine
Closes #23.
Co-authored-by: Binbin [email protected]